In this analysis, I will build different types of machine learning (ML) models to predict the rental price of apartments in Madrid. The key features used for prediction include the surface area, number of bathrooms, number of rooms, and the neighborhood where the apartment is located.
The data for this analysis was obtained through a web scraping project from the real estate website Fotocasa.com, which is a well-known and widely used platform in Spain for property listings.
library(dplyr)
library(caret)
library(ModelMatrixModel)
library(mgcv)
library(ggplot2)
library(sf)
library(car)
library(corrplot)
library(MASS)
library(RcmdrMisc)
library(randomForest)
library(gbm)
library(mpae)
library(knitr)
library(pdp)
library(gridExtra)
library(glmnet)
data<-read.csv2("/Users/aa/Downloads/apartment_data.csv",sep=",",header=T)
madrid<-data[251:nrow(data),]
summary(madrid)
## Price Square.Meters Rooms Bathrooms
## Length:5489 Min. : 1.00 Min. : 1.000 Min. : 1.000
## Class :character 1st Qu.: 45.00 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Median : 70.00 Median : 2.000 Median : 1.000
## Mean : 85.55 Mean : 2.195 Mean : 8.502
## 3rd Qu.: 105.00 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :8300.00 Max. :430.000 Max. :695.000
## NA's :763 NA's :2 NA's :10
## Zona
## Length:5489
## Class :character
## Mode :character
##
##
##
##
glimpse(madrid)
## Rows: 5,489
## Columns: 5
## $ Price <chr> "990", "1.400", "1.165", "1.800", "1.110", "1.206", "11.…
## $ Square.Meters <int> 68, 71, 55, 30, 390, 250, 82, 125, 90, 82, 152, 70, 42, …
## $ Rooms <int> 2, 1, 3, 1, 3, 3, 2, 2, 3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 1,…
## $ Bathrooms <int> 2, 1, 1, 1, 4, 4, 2, 2, 1, 2, 3, 1, 1, 2, 1, 2, 1, 2, 34…
## $ Zona <chr> "Tetuán", "Salamanca", "Carabanchel", "Centro", "Caraban…
From the summary function, we can obtain plenty of information from the data. The first step would be to change the variable type for “Price” since it is essential to have it as numeric. Next, it is noticeable that there is a significant amount of NA data, which will be problematic for the subsequent stages of the analysis. Therefore, it is better to remove those rows with NA values from the dataset.
Afterwards, it is essential to analyze the distribution of the variables and their potential influence on the target variable. For instance, the relationship between the number of rooms and the rent price should be examined in detail.
The exploratory data analysis will include various graphical representations such as histograms, scatter plots, and boxplots to visualize the relationships between the variables.
madrid$Price<-as.numeric(gsub("\\.","",madrid$Price))
madrid<-madrid%>%
filter(Zona!="Capital")
madrid <- madrid %>%
mutate(Zona = ifelse(Zona %in% c("Barrio de Salamanca", "Salamanca"), "Salamanca", Zona)) %>%
mutate(Zona = ifelse(Zona %in% c("Blas", "San Blas"), "San Blas - Canillejas", Zona)) %>%
mutate(Zona = ifelse(Zona %in% c("Vallecas","Villa de Vallecas"), "Villa de Vallecas", Zona)) %>%
mutate(Zona = ifelse(Zona %in% c("Lineal", "Ciudad Lineal"), "Ciudad Lineal", Zona)) %>%
mutate(Zona = ifelse(Zona %in% c("Moncloa - Aravaca", "Aravaca"), "Moncloa - Aravaca", Zona)) %>%
filter(Bathrooms < 10) %>%
filter(Square.Meters > 10) %>%
mutate(Zona = as.factor(Zona))
madrid<-na.omit(madrid)
summary(madrid)
## Price Square.Meters Rooms Bathrooms
## Min. : 450 Min. : 11.00 Min. : 1.000 Min. :1.000
## 1st Qu.: 1400 1st Qu.: 55.00 1st Qu.: 1.000 1st Qu.:1.000
## Median : 2100 Median : 75.00 Median : 2.000 Median :1.000
## Mean : 2575 Mean : 99.17 Mean : 2.136 Mean :1.644
## 3rd Qu.: 3172 3rd Qu.: 114.00 3rd Qu.: 3.000 3rd Qu.:2.000
## Max. :37000 Max. :8300.00 Max. :10.000 Max. :8.000
##
## Zona
## Centro :1046
## Salamanca: 587
## Chamberí : 532
## Tetuán : 278
## Chamartín: 252
## Retiro : 236
## (Other) :1123
glimpse(madrid )
## Rows: 4,054
## Columns: 5
## $ Price <dbl> 990, 1400, 1165, 1800, 1110, 1206, 11000, 4500, 1250, 18…
## $ Square.Meters <int> 68, 71, 55, 30, 390, 250, 82, 125, 90, 82, 152, 70, 42, …
## $ Rooms <int> 2, 1, 3, 1, 3, 3, 2, 2, 3, 2, 3, 2, 1, 3, 1, 2, 1, 3, 2,…
## $ Bathrooms <int> 2, 1, 1, 1, 4, 4, 2, 2, 1, 2, 3, 1, 1, 2, 1, 2, 1, 2, 1,…
## $ Zona <fct> Tetuán, Salamanca, Carabanchel, Centro, Carabanchel, Sal…
First things first, there are pretty high values in Price and Square.Meters that look like outliers.
ggplot(madrid, aes(x = Price)) +
geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
labs(title = "Histogram of Price",
x = "Price",
y = "Frequency") +
theme_classic()
ggplot(madrid, aes(x = Square.Meters)) +
geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
labs(title = "Histogram of Square.Meters",
x = "Price",
y = "Frequency") +
theme_classic()
Both histograms are very asymetrical specially Square Meters indicating the presence of big outliers.
ggplot(madrid, aes(x = `Square.Meters`, y = Price,color = Zona)) +
geom_point() +
labs(title = "Price vs. Square Meters",
x = "Square.Meters",
y = "Price") +
theme_classic()
Here we can clearly see two big outliers for Square.Meters and three for the Price variable.
Now lets remove the first three biggest values in Price.
top_prices_indices <- order(madrid$Price, decreasing = TRUE)[1:3]
madrid$Price[top_prices_indices]
## [1] 37000 25000 25000
madrid <- madrid[-top_prices_indices, ]
These data prices are huge so its better to remove them otherwise we can have problems with the regression models.
Following the same procedure for the two values in Square Meters.
top_prices_indices2 <- order(madrid$Square.Meters, decreasing = TRUE)[1:3]
madrid$Square.Meters[top_prices_indices2]
## [1] 8300 7575 1308
madrid <- madrid[-top_prices_indices2, ]
madrid_subset <- madrid[, !names(madrid) %in% "Zona"]
plot(madrid_subset)
There is some type of linear relationship between some variables like
rooms and square meters and also bathrooms and square meters, this is
not really surprising considering that in general if a aparment has lots
of rooms it also has to have more spacel.This relationship between
variables can be a problem when using linear regression models,so its
important to take a look at the colinearity when using those types of
models.
This plot shows the relationship between Price and some variables in the
data set. It`s clear that there’s a linear trend and also a lot of
outliers in the data.The data points seem to have a lot of variance, to
reduce the variability and increase the performance of the models the
log transformation will be used in the Price column, this way the data
will look like this.
Now the data its less scatter and more tight.The trend now looks a bit
different from the other plot,you could say that there’s a cuadratic
relationship between Prices and Square Meters, the interpretation would
be that the prices increases as the square meters increase but only up
to a point where the increase of the price is smaller in marginal
terms.
In the case of the variable Zona we have 23 neighborhoods for the city of madrid.Here`s a map that shows the amount of aparments for each district.
## Reading layer `Distritos' from data source
## `/Users/aa/Downloads/Distritos/Distritos.shp' using driver `ESRI Shapefile'
## Simple feature collection with 21 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 424753.5 ymin: 4462566 xmax: 456033.3 ymax: 4499365
## Projected CRS: ETRS89 / UTM zone 30N
## [1] Tetuán Salamanca Carabanchel
## [4] Centro Retiro Arganzuela
## [7] Hortaleza Chamartín Pardo
## [10] Moncloa - Aravaca Chamberí Puente de Vallecas
## [13] Vicálvaro Ciudad Lineal San Blas - Canillejas
## [16] Villa de Vallecas Moratalaz Usera
## [19] Villaverde Latina Fuencarral - El Pardo
## [22] Barajas
## 22 Levels: Arganzuela Barajas Carabanchel Centro Chamartín ... Villaverde
| Var1 | Freq |
|---|---|
| Arganzuela | 175 |
| Barajas | 21 |
| Carabanchel | 93 |
| Centro | 1045 |
| Chamartín | 252 |
| Chamberí | 530 |
| Ciudad Lineal | 111 |
| Fuencarral - El Pardo | 47 |
| Hortaleza | 78 |
| Latina | 81 |
| Moncloa - Aravaca | 156 |
| Moratalaz | 8 |
| Pardo | 35 |
| Puente de Vallecas | 19 |
| Retiro | 235 |
| Salamanca | 586 |
| San Blas - Canillejas | 73 |
| Tetuán | 278 |
| Usera | 68 |
| Vicálvaro | 21 |
| Villa de Vallecas | 73 |
| Villaverde | 63 |
A good portion of the aparments are located in the city center and closer areas, in the other hand cheaper neighborhoods have less presence in the data set, this can explain why the average prices is so high (2570.575).
*Note: I think Fotocasa.com has change the way neighborhood info is display, now instead of having the neighborhood name it has the street name. This could be great for doing spatial analysis.
This analysis is going to be performed from a machine learning perspective, this means that first the data will be split into two different groups,train and test, the train part is designed to adjust the models parameters and the test is for evaluating the performance. Sencondly the aim of the data modelling is going to be maximizing the accuracy of the predictions,that means we will be less strict with models assumptions.
set.seed(1)
nobs <- nrow(madrid)
itrain <- sample(nobs, 0.8 * nobs)
train <- madrid[itrain, ]
test <- madrid[-itrain, ]
train$Price <- log(train$Price)
test$Price <- log(test$Price)
dim(train)
## [1] 3238 5
dim(test)
## [1] 810 5
Linear regression is a popular and straightforward model that is easy to apply and provides a clear understanding of how variables interact with each other. The main objective of linear regression is to find a line that minimizes the following equation.
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] Where: \(n\) is the number of observations, \(y_i\) is the actual value, \(\hat{y}_i\) is the predicted value.
In simpler terms, the goal is to identify a line in the variable space that minimizes the differences between the observed values (e.g., actual prices) and the predicted values generated by the model.
While linear regression is a great starting point due to its simplicity and interpretability, it is not always the most flexible model for handling diverse data. This limitation arises from the assumptions it makes:
1.Linearity: The relationship between the predictors and the response variable is assumed to be linear. If the true relationship is nonlinear, the model may struggle to fit the data accurately.In this case this is not an issue since the data seems to have linear relationships as said before.
2.Independence: The observations are assumed to be independent of each other.
3.Homoscedasticity: The variance of the residuals (errors) should remain constant across all levels of the predictors.
4.Normality of Residuals: The residuals are assumed to follow a normal distribution.
5.No Multicollinearity: The predictors should not be highly correlated with each other, as multicollinearity can make the model unstable and affect the interpretation of coefficients.
Violating these assumptions can lead to biased estimations or poor predictive performance.
Lets fit the model.
linear_model<-lm(Price~Square.Meters+Rooms+Bathrooms+Zona,data=train)
summary(linear_model)
##
## Call:
## lm(formula = Price ~ Square.Meters + Rooms + Bathrooms + Zona,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19928 -0.20420 -0.01267 0.21600 1.52373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8953059 0.0302371 228.041 < 2e-16 ***
## Square.Meters 0.0029653 0.0001584 18.724 < 2e-16 ***
## Rooms 0.0299308 0.0082933 3.609 0.000312 ***
## Bathrooms 0.1938949 0.0122566 15.820 < 2e-16 ***
## ZonaBarajas -0.2502175 0.0880229 -2.843 0.004502 **
## ZonaCarabanchel -0.3197336 0.0483269 -6.616 4.31e-11 ***
## ZonaCentro 0.1964012 0.0302205 6.499 9.34e-11 ***
## ZonaChamartín 0.1453546 0.0364915 3.983 6.95e-05 ***
## ZonaChamberí 0.2900938 0.0324892 8.929 < 2e-16 ***
## ZonaCiudad Lineal -0.1459336 0.0450800 -3.237 0.001219 **
## ZonaFuencarral - El Pardo -0.2038817 0.0604059 -3.375 0.000746 ***
## ZonaHortaleza -0.1411821 0.0500151 -2.823 0.004790 **
## ZonaLatina -0.2258216 0.0476419 -4.740 2.23e-06 ***
## ZonaMoncloa - Aravaca -0.0390976 0.0411054 -0.951 0.341598
## ZonaMoratalaz -0.3797751 0.1519834 -2.499 0.012511 *
## ZonaPardo -0.0662384 0.0670824 -0.987 0.323512
## ZonaPuente de Vallecas -0.4010089 0.0906430 -4.424 1.00e-05 ***
## ZonaRetiro 0.1958101 0.0367823 5.323 1.09e-07 ***
## ZonaSalamanca 0.3424533 0.0320946 10.670 < 2e-16 ***
## ZonaSan Blas - Canillejas -0.3472714 0.0527358 -6.585 5.29e-11 ***
## ZonaTetuán 0.0252028 0.0361668 0.697 0.485947
## ZonaUsera -0.3671235 0.0523111 -7.018 2.73e-12 ***
## ZonaVicálvaro -0.2965868 0.0835456 -3.550 0.000391 ***
## ZonaVilla de Vallecas -0.3394939 0.0537508 -6.316 3.05e-10 ***
## ZonaVillaverde -0.3965639 0.0545057 -7.276 4.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3339 on 3213 degrees of freedom
## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6499
## F-statistic: 251.4 on 24 and 3213 DF, p-value: < 2.2e-16
Right off the bat we get a pretty solid model in terms of variable signification.There are a few areas that are not significative such as Moncloa - Aravaca,Pardo and Tetuán.The r squared is not to high but with a few adjustment we can increase it.
Now its time to check the assumptions made by the model.First lets analyze the residuals to check whether they are normal or not.
res<-linear_model$residuals
qqPlot(res)
## 3733 6
## 1356 2715
The data doesn’t look normal from the qqplot, so lets run a shapiro test to measure it.
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.98813, p-value = 7.854e-16
The residuals are clearly not normal distributed. Depending on the goal of the analysis this could be an issue.If the goal is to predict is not the end of the world and we still can trust the prediction resutls, but, if the goal is to make inference from the data this is problematic since we won’t be able to trust the models outcomes.
Lets test multicolinearity.
madrid_subset <- madrid[, !names(madrid) %in% "Zona"]
mat_cor<-cor(madrid_subset)
corrplot(mat_cor,method = "ellipse")
car::vif(linear_model)
## GVIF Df GVIF^(1/(2*Df))
## Square.Meters 3.156749 1 1.776724
## Rooms 2.195207 1 1.481623
## Bathrooms 3.345565 1 1.829089
## Zona 1.199243 21 1.004335
It doesn’t look that there`s much multicolinearity between the variables in the model fitted. Now its time to use the stepwise function which will select the variables in the model dropping the non significative ones.
final_linear_model<-stepwise(linear_model)
##
## Direction: backward/forward
## Criterion: BIC
##
## Start: AIC=-6926.44
## Price ~ Square.Meters + Rooms + Bathrooms + Zona
##
## Df Sum of Sq RSS AIC
## <none> 358.24 -6926.4
## - Rooms 1 1.452 359.69 -6921.4
## - Bathrooms 1 27.903 386.14 -6691.7
## - Square.Meters 1 39.088 397.33 -6599.2
## - Zona 21 136.303 494.54 -6052.1
The final model has the variables Square Meters, Bathrooms ,Rooms and Zona.
Looking at the relationship between the variables I suggest adding the interaction of Square Meters ,Bathrooms and Rooms
model_with_interaction<-lm(Price~(Square.Meters*Bathrooms*Rooms)+Zona,data=train)
summary(model_with_interaction)
##
## Call:
## lm(formula = Price ~ (Square.Meters * Bathrooms * Rooms) + Zona,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78910 -0.19632 -0.01664 0.18777 1.46084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.403e+00 4.551e-02 140.690 < 2e-16 ***
## Square.Meters 8.109e-03 5.282e-04 15.352 < 2e-16 ***
## Bathrooms 3.985e-01 2.440e-02 16.334 < 2e-16 ***
## Rooms 1.228e-01 1.587e-02 7.735 1.37e-14 ***
## ZonaBarajas -2.770e-01 8.271e-02 -3.349 0.000820 ***
## ZonaCarabanchel -2.963e-01 4.549e-02 -6.513 8.49e-11 ***
## ZonaCentro 1.966e-01 2.843e-02 6.914 5.64e-12 ***
## ZonaChamartín 1.434e-01 3.433e-02 4.177 3.03e-05 ***
## ZonaChamberí 2.584e-01 3.059e-02 8.448 < 2e-16 ***
## ZonaCiudad Lineal -1.460e-01 4.237e-02 -3.445 0.000579 ***
## ZonaFuencarral - El Pardo -2.589e-01 5.682e-02 -4.557 5.39e-06 ***
## ZonaHortaleza -1.612e-01 4.703e-02 -3.428 0.000617 ***
## ZonaLatina -1.977e-01 4.481e-02 -4.413 1.05e-05 ***
## ZonaMoncloa - Aravaca 3.918e-03 3.869e-02 0.101 0.919346
## ZonaMoratalaz -3.811e-01 1.428e-01 -2.668 0.007665 **
## ZonaPardo -2.928e-02 6.314e-02 -0.464 0.642814
## ZonaPuente de Vallecas -4.005e-01 8.517e-02 -4.703 2.68e-06 ***
## ZonaRetiro 1.800e-01 3.458e-02 5.206 2.06e-07 ***
## ZonaSalamanca 2.942e-01 3.027e-02 9.720 < 2e-16 ***
## ZonaSan Blas - Canillejas -3.558e-01 4.958e-02 -7.176 8.86e-13 ***
## ZonaTetuán 9.939e-03 3.400e-02 0.292 0.770039
## ZonaUsera -3.460e-01 4.919e-02 -7.034 2.45e-12 ***
## ZonaVicálvaro -3.526e-01 7.856e-02 -4.488 7.42e-06 ***
## ZonaVilla de Vallecas -3.027e-01 5.060e-02 -5.982 2.45e-09 ***
## ZonaVillaverde -3.978e-01 5.132e-02 -7.751 1.21e-14 ***
## Square.Meters:Bathrooms -1.110e-03 1.660e-04 -6.686 2.70e-11 ***
## Square.Meters:Rooms -8.256e-04 1.408e-04 -5.863 5.00e-09 ***
## Bathrooms:Rooms -4.474e-02 5.733e-03 -7.804 8.03e-15 ***
## Square.Meters:Bathrooms:Rooms 1.545e-04 3.352e-05 4.611 4.17e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3137 on 3209 degrees of freedom
## Multiple R-squared: 0.6936, Adjusted R-squared: 0.691
## F-statistic: 259.5 on 28 and 3209 DF, p-value: < 2.2e-16
car::vif(model_with_interaction, type = 'predictor')
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## Square.Meters 1.31666 7 1.019844 Bathrooms, Rooms
## Bathrooms 1.31666 7 1.019844 Square.Meters, Rooms
## Rooms 1.31666 7 1.019844 Square.Meters, Bathrooms
## Zona 1.31666 21 1.006571 --
## Other Predictors
## Square.Meters Zona
## Bathrooms Zona
## Rooms Zona
## Zona Square.Meters, Bathrooms, Rooms
No problems with multicollinearity.
To make sure this model is better than the previuos one we can use the Analysis of variance.(Anova)
anova(final_linear_model,model_with_interaction)
## Analysis of Variance Table
##
## Model 1: Price ~ Square.Meters + Rooms + Bathrooms + Zona
## Model 2: Price ~ (Square.Meters * Bathrooms * Rooms) + Zona
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3213 358.24
## 2 3209 315.81 4 42.425 107.77 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value is very small indicating that the second model explainds better the data.
For measuring the accuracy of the model will be using RMSE and the \(\ R^2\).
accuracy <- function(pred, obs, na.rm = FALSE,
tol = sqrt(.Machine$double.eps),
row_name = NULL,
results_df = NULL) {
# Calculate errors
err <- obs - pred
if(na.rm) {
is.a <- !is.na(err)
err <- err[is.a]
obs <- obs[is.a]
}
# Calculate metrics
metrics <- c(
rmse = sqrt(mean(err^2)),
mae = mean(abs(err)),
r.squared = 1 - sum(err^2) / sum((obs - mean(obs))^2)
)
# Convert metrics to a data frame
metrics_df <- as.data.frame(t(metrics))
# Add row name if provided
if (!is.null(row_name)) {
rownames(metrics_df) <- row_name
}
# Append metrics to the results data frame if provided
if (!is.null(results_df)) {
results_df <- rbind(results_df, metrics_df)
} else {
results_df <- metrics_df
}
return(results_df)
}
obs<-test$Price
lm_prediction<-predict(model_with_interaction,newdata=test)
ab<-accuracy(lm_prediction,test$Price,na.rm = T,row_name = "Linear Regression" )
pred.plot(lm_prediction, obs, xlab = "Predicción", ylab = "Observado")
\[ \hat{y} = X \beta + \lambda \sum_{j=1}^{p} \beta_j^2 \] Where 𝜆 is the penalty parameter.This type of model tends to consider all the variables but reducing the size of the coefficients.
x_train<-(model.matrix(Price ~ ., data = train))[, -1]
y_train<- train$Price
x_test<-(model.matrix(Price ~ ., data = test))[, -1]
y_test<-test$Price
set.seed(1)
cv.ridge <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.ridge)
cv.ridge$lambda.1se
## [1] 0.0987712
Our lambda is 0.1724154.
coef(cv.ridge)
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 6.993662420
## Square.Meters 0.002536255
## Rooms 0.047488132
## Bathrooms 0.179370720
## ZonaBarajas -0.275217796
## ZonaCarabanchel -0.346419613
## ZonaCentro 0.107474561
## ZonaChamartín 0.068648020
## ZonaChamberí 0.194867586
## ZonaCiudad Lineal -0.188903310
## ZonaFuencarral - El Pardo -0.239033638
## ZonaHortaleza -0.179712904
## ZonaLatina -0.269942836
## ZonaMoncloa - Aravaca -0.084214931
## ZonaMoratalaz -0.403232153
## ZonaPardo -0.119121625
## ZonaPuente de Vallecas -0.415076422
## ZonaRetiro 0.110994306
## ZonaSalamanca 0.244993851
## ZonaSan Blas - Canillejas -0.367939788
## ZonaTetuán -0.043127245
## ZonaUsera -0.389398075
## ZonaVicálvaro -0.317010742
## ZonaVilla de Vallecas -0.369420984
## ZonaVillaverde -0.415570046
As can be seen in the coefficients none of them are really high.
pred <- predict(cv.ridge, newx = x_test) # s = "lambda.1se"
ridge<-accuracy(pred, y_test,row_name = "Ridge regression");ridge
## rmse mae r.squared
## Ridge regression 0.3353355 0.2679801 0.6316163
performance<-rbind(ab,ridge)
\[ \hat{y} = X \beta + \lambda \sum_{j=1}^{p} |\beta_j| \] LASSO can shrink some coefficients to exactly zero. This effectively removes certain features from the model, making it useful for feature selection.
cv.lasso<- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.lasso)
cv.lasso$lambda.1se
## [1] 0.01808242
coef(cv.lasso)
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 6.943187816
## Square.Meters 0.002905329
## Rooms .
## Bathrooms 0.214492280
## ZonaBarajas -0.005263230
## ZonaCarabanchel -0.202093999
## ZonaCentro 0.145768620
## ZonaChamartín 0.067491485
## ZonaChamberí 0.230704034
## ZonaCiudad Lineal -0.045164820
## ZonaFuencarral - El Pardo -0.039193100
## ZonaHortaleza -0.019478162
## ZonaLatina -0.114153467
## ZonaMoncloa - Aravaca .
## ZonaMoratalaz .
## ZonaPardo .
## ZonaPuente de Vallecas -0.143026455
## ZonaRetiro 0.117361005
## ZonaSalamanca 0.286424246
## ZonaSan Blas - Canillejas -0.209006690
## ZonaTetuán .
## ZonaUsera -0.239168580
## ZonaVicálvaro -0.062302102
## ZonaVilla de Vallecas -0.205297458
## ZonaVillaverde -0.257253325
Here the coefficients remove are exactly those that where non significant in the linear regression model, so the final model has 19 variables.
pred <- predict(cv.lasso, newx = x_test)
lasso<-accuracy(pred, y_test,row_name = "Lasso");lasso
## rmse mae r.squared
## Lasso 0.3423582 0.2749201 0.6160249
performance<-rbind(performance,lasso)
Elastic Net combines LASSO and Ridge.
caret.glmnet <- train(x_train, y_train, method = "glmnet",
preProc = c("zv", "center", "scale"), tuneLength = 10,
trControl = trainControl(method = "cv", number = 10))
caret.glmnet
## glmnet
##
## 3238 samples
## 24 predictor
##
## Pre-processing: centered (24), scaled (24)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2914, 2914, 2915, 2915, 2915, 2913, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 0.0004158078 0.3350818 0.6518788 0.2599257
## 0.1 0.0009605700 0.3350818 0.6518788 0.2599257
## 0.1 0.0022190412 0.3350818 0.6518788 0.2599257
## 0.1 0.0051262730 0.3350615 0.6519340 0.2601036
## 0.1 0.0118423555 0.3350957 0.6520260 0.2607078
## 0.1 0.0273573772 0.3355603 0.6518362 0.2622981
## 0.1 0.0631990896 0.3380792 0.6500819 0.2670688
## 0.1 0.1459980940 0.3488168 0.6405680 0.2809946
## 0.1 0.3372745330 0.3811772 0.6037253 0.3137708
## 0.2 0.0004158078 0.3350877 0.6518664 0.2599069
## 0.2 0.0009605700 0.3350877 0.6518664 0.2599069
## 0.2 0.0022190412 0.3350877 0.6518676 0.2599094
## 0.2 0.0051262730 0.3350923 0.6519408 0.2602398
## 0.2 0.0118423555 0.3352439 0.6519592 0.2610456
## 0.2 0.0273573772 0.3363540 0.6510874 0.2635569
## 0.2 0.0631990896 0.3417516 0.6455568 0.2717741
## 0.2 0.1459980940 0.3624660 0.6180718 0.2951903
## 0.2 0.3372745330 0.4080538 0.5499976 0.3370574
## 0.3 0.0004158078 0.3350978 0.6518545 0.2598920
## 0.3 0.0009605700 0.3350978 0.6518545 0.2598920
## 0.3 0.0022190412 0.3350943 0.6518750 0.2599532
## 0.3 0.0051262730 0.3351390 0.6519206 0.2603825
## 0.3 0.0118423555 0.3354885 0.6517211 0.2614895
## 0.3 0.0273573772 0.3374738 0.6497550 0.2652097
## 0.3 0.0631990896 0.3468303 0.6376983 0.2776144
## 0.3 0.1459980940 0.3761497 0.5922956 0.3081332
## 0.3 0.3372745330 0.4219972 0.5347495 0.3493966
## 0.4 0.0004158078 0.3351007 0.6518514 0.2598784
## 0.4 0.0009605700 0.3351007 0.6518514 0.2598784
## 0.4 0.0022190412 0.3351092 0.6518703 0.2600126
## 0.4 0.0051262730 0.3352028 0.6518714 0.2605330
## 0.4 0.0118423555 0.3358049 0.6513631 0.2620334
## 0.4 0.0273573772 0.3389111 0.6477980 0.2671329
## 0.4 0.0631990896 0.3530272 0.6267536 0.2842229
## 0.4 0.1459980940 0.3898413 0.5615312 0.3200368
## 0.4 0.3372745330 0.4332329 0.5269656 0.3590144
## 0.5 0.0004158078 0.3351031 0.6518458 0.2598562
## 0.5 0.0009605700 0.3351031 0.6518458 0.2598562
## 0.5 0.0022190412 0.3351267 0.6518618 0.2600746
## 0.5 0.0051262730 0.3352824 0.6517952 0.2607027
## 0.5 0.0118423555 0.3361696 0.6509316 0.2626429
## 0.5 0.0273573772 0.3407377 0.6449880 0.2694845
## 0.5 0.0631990896 0.3589289 0.6159205 0.2903502
## 0.5 0.1459980940 0.3973219 0.5474109 0.3267314
## 0.5 0.3372745330 0.4441066 0.5267091 0.3680285
## 0.6 0.0004158078 0.3351091 0.6518409 0.2598605
## 0.6 0.0009605700 0.3351091 0.6518409 0.2598605
## 0.6 0.0022190412 0.3351472 0.6518482 0.2601377
## 0.6 0.0051262730 0.3353809 0.6516874 0.2608908
## 0.6 0.0118423555 0.3365810 0.6504194 0.2632992
## 0.6 0.0273573772 0.3428302 0.6415898 0.2720196
## 0.6 0.0631990896 0.3648122 0.6044508 0.2962481
## 0.6 0.1459980940 0.4031304 0.5381556 0.3320547
## 0.6 0.3372745330 0.4567186 0.5262713 0.3782436
## 0.7 0.0004158078 0.3351090 0.6518383 0.2598508
## 0.7 0.0009605700 0.3351107 0.6518381 0.2598649
## 0.7 0.0022190412 0.3351708 0.6518292 0.2602015
## 0.7 0.0051262730 0.3354885 0.6515661 0.2610911
## 0.7 0.0118423555 0.3370644 0.6497728 0.2640249
## 0.7 0.0273573772 0.3451076 0.6377246 0.2746755
## 0.7 0.0631990896 0.3708486 0.5919455 0.3019369
## 0.7 0.1459980940 0.4084697 0.5304748 0.3368140
## 0.7 0.3372745330 0.4712417 0.5254267 0.3899085
## 0.8 0.0004158078 0.3351104 0.6518347 0.2598421
## 0.8 0.0009605700 0.3351170 0.6518355 0.2598896
## 0.8 0.0022190412 0.3351975 0.6518057 0.2602672
## 0.8 0.0051262730 0.3356049 0.6514339 0.2613025
## 0.8 0.0118423555 0.3376250 0.6489749 0.2648216
## 0.8 0.0273573772 0.3474947 0.6335652 0.2774138
## 0.8 0.0631990896 0.3773437 0.5774159 0.3077672
## 0.8 0.1459980940 0.4129159 0.5260807 0.3408122
## 0.8 0.3372745330 0.4878927 0.5235124 0.4029263
## 0.9 0.0004158078 0.3351115 0.6518320 0.2598350
## 0.9 0.0009605700 0.3351236 0.6518323 0.2599146
## 0.9 0.0022190412 0.3352268 0.6517776 0.2603341
## 0.9 0.0051262730 0.3357272 0.6512949 0.2615302
## 0.9 0.0118423555 0.3382817 0.6479732 0.2657017
## 0.9 0.0273573772 0.3499519 0.6291315 0.2801228
## 0.9 0.0631990896 0.3835704 0.5628340 0.3132165
## 0.9 0.1459980940 0.4165865 0.5256772 0.3441245
## 0.9 0.3372745330 0.5067922 0.5179439 0.4174473
## 1.0 0.0004158078 0.3351126 0.6518292 0.2598296
## 1.0 0.0009605700 0.3351311 0.6518276 0.2599406
## 1.0 0.0022190412 0.3352594 0.6517444 0.2604037
## 1.0 0.0051262730 0.3358575 0.6511437 0.2617711
## 1.0 0.0118423555 0.3390209 0.6467949 0.2666697
## 1.0 0.0273573772 0.3525407 0.6242712 0.2828532
## 1.0 0.0631990896 0.3878177 0.5534166 0.3170117
## 1.0 0.1459980940 0.4206505 0.5250983 0.3476921
## 1.0 0.3372745330 0.5279847 0.4987109 0.4334937
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.005126273.
Here I used a bunch of different combinations of parameters to see if got better results.
ggplot(caret.glmnet, highlight = TRUE)
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 9 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_point()`).
pred <- predict(caret.glmnet, newdata = x_test)
elastic<-accuracy(pred, y_test,row_name = "Elastic Net")
performance<-rbind(performance,elastic)
rf<-randomForest(Price~.,data=train);rf
##
## Call:
## randomForest(formula = Price ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.1019787
## % Var explained: 67.97
plot(rf, main = "Random Forest")
The error stabilizes around 200 trees in.
varImpPlot(rf)
Using random forest we can see which variables are the most important.Square Meters its the most important one followed by Bathrooms, these two variables explained most of the model, then we get Zona (neighborhood) and finally Room seems to be the less important, this could be because the the information contained in Rooms is already contain in Square Meters.
pdp1 <- partial(rf, "Square.Meters")
p1 <- plotPartial(pdp1)
pdp2 <- partial(rf, c("Bathrooms"))
p2 <- plotPartial(pdp2)
pdp3 <- partial(rf, c("Rooms"))
p3 <- plotPartial(pdp3)
grid.arrange(p1, p2,p3, ncol = 3)
The partial plot allow us to see how one particular variable interacts with the target variable. In this case we can see that increasing the surface of the aparments increases the prices up to a certain point where stops, this could be caused by a lack of data.The bathrooms have the same trend and finally increasing the number of rooms increases the price but when we get to 7 rooms there’s a decrease of the price, again this could either be because the lack of data (I think there’s only one aparment with 10 rooms) or it coulb be an underlying trend.
pred <- predict(rf, newdata = test)
rf<-accuracy(pred, obs,row_name = "Random Forest");rf
## rmse mae r.squared
## Random Forest 0.3151583 0.2555943 0.6746139
performance<-rbind(performance,rf)
gam <- gam(
Price ~ s(Square.Meters,Bathrooms,Rooms)+ Zona,
data = train,
family = gaussian()
)
summary(gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Price ~ s(Square.Meters, Bathrooms, Rooms) + Zona
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.55815 0.02584 292.542 < 2e-16 ***
## ZonaBarajas -0.27433 0.08159 -3.362 0.000783 ***
## ZonaCarabanchel -0.28326 0.04507 -6.285 3.73e-10 ***
## ZonaCentro 0.20928 0.02804 7.463 1.09e-13 ***
## ZonaChamartín 0.14544 0.03390 4.291 1.83e-05 ***
## ZonaChamberí 0.26413 0.03027 8.725 < 2e-16 ***
## ZonaCiudad Lineal -0.13789 0.04193 -3.289 0.001017 **
## ZonaFuencarral - El Pardo -0.24541 0.05607 -4.377 1.24e-05 ***
## ZonaHortaleza -0.16130 0.04647 -3.471 0.000525 ***
## ZonaLatina -0.18936 0.04415 -4.289 1.85e-05 ***
## ZonaMoncloa - Aravaca 0.01431 0.03830 0.374 0.708644
## ZonaMoratalaz -0.39818 0.14044 -2.835 0.004608 **
## ZonaPardo -0.05117 0.06235 -0.821 0.411876
## ZonaPuente de Vallecas -0.40069 0.08380 -4.782 1.82e-06 ***
## ZonaRetiro 0.17702 0.03411 5.190 2.23e-07 ***
## ZonaSalamanca 0.29334 0.02987 9.820 < 2e-16 ***
## ZonaSan Blas - Canillejas -0.36123 0.04900 -7.373 2.12e-13 ***
## ZonaTetuán 0.01693 0.03350 0.505 0.613256
## ZonaUsera -0.33463 0.04876 -6.863 8.08e-12 ***
## ZonaVicálvaro -0.34632 0.07737 -4.476 7.88e-06 ***
## ZonaVilla de Vallecas -0.30178 0.04996 -6.041 1.71e-09 ***
## ZonaVillaverde -0.40584 0.05098 -7.962 2.35e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Square.Meters,Bathrooms,Rooms) 44.93 54.8 88.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.702 Deviance explained = 70.8%
## GCV = 0.096829 Scale est. = 0.094828 n = 3238
vis.gam(gam, view = c("Square.Meters", "Bathrooms"), plot.type = "persp",
theta = 30, phi = 30,
main = "3D Interaction of Square Meters and Bathrooms",
color = "heat", zlab = "Price")
vis.gam(gam, view = c("Square.Meters", "Bathrooms"), plot.type = "contour",
color = "topo", main = "2D Contour Plot of Square Meters and Bathrooms")
In those two graphs we can see the interaction between the surface and
bathrooms.The interaction between them can be clearly seen.
pred <- predict(gam, newdata = test)
g<-accuracy(pred, obs,row_name = "GAM");g
## rmse mae r.squared
## GAM 0.3114544 0.2464323 0.6822173
performance<-rbind(performance,g)
pred.plot(pred, obs, xlab = "Predicción", ylab = "Observado")
trControl <- trainControl(method = "cv", number = 5)
caret.xgb <- train(Price ~ ., method = "xgbTree", data = train,
trControl = trControl, verbosity = 0)
caret.xgb
pred <- predict(caret.xgb, newdata = test)
xgb<-accuracy(pred, obs,row_name = "XGB");xgb
## rmse mae r.squared
## XGB 0.3029643 0.2410419 0.6993063
performance<-rbind(performance,xgb)
| rmse | mae | r.squared | |
|---|---|---|---|
| Linear Regression | 0.3080600 | 0.2458122 | 0.6891063 |
| Ridge regression | 0.3353355 | 0.2679801 | 0.6316163 |
| Lasso | 0.3423582 | 0.2749201 | 0.6160249 |
| Elastic Net | 0.3337781 | 0.2625992 | 0.6350301 |
| Random Forest | 0.3151583 | 0.2555943 | 0.6746139 |
| GAM | 0.3114544 | 0.2464323 | 0.6822173 |
| XGB | 0.3029643 | 0.2410419 | 0.6993063 |
After testing different models, the results were quite similar. The XGBoost model had the highest R-squared, but the difference compared to the other models was very small. For this reason, the linear regression model seems like a better option because it is simple, easy to understand, and works efficiently.
In summary, the models were not able to fully explain the variations in the data. One of the main challenges was the high variability in the data, which made it difficult for the models to detect clear patterns. Another possible reason is the lack of important information, such as the condition or quality of the apartment, which can have a big effect on the rental price. Including this type of information in future models could help improve the predictions.